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Abstract: The spectral line datacubes obtained from the Square Kilometre Array (SKA) and its 
precursors, such as the Australian SKA Pathfinder (ASKAP), will be sufficiently large to necessitate 
automated detection and parametrisation of sources. Matched filtering is widely acknowledged as the 
best possible method for the automated detection of sources. This paper presents the Characterised 
Noise Hi (CNHI) source finder, which employs a novel implementation of matched filtering. This 
implementation is optimised for the 3-D nature of the planned Wide-field ASKAP Legacy L-band All- 
sky Blind surveY's (WALLABY) Hi spectral line observations. The CNHI source finder also employs 
a novel sparse representation of 3-D objects, with a high compression rate, to implement Lutz one-pass 
algorithm on datacubes that are too large to process in a single pass. 

WALLABY will use ASKAP's phenomenal 30 square degree field of view to image ~ 70% of the sky. 
It is expected that WALLABY will find 500 000 Hi galaxies out to z ~ 0.2. 



Keywords: methods: data analysis 
image processing 



methods: statistical — radio lines: galaxies — techniques: 



1 Introduction 

The Wide-field ASKAP Legacy L-band All-sky Blind 
surveY (WALLABYfl ijKoribalski et all ||2009D : Ko- 
ribalski, B., Staveley-Smith, L. et al., in preparation) 
is an ambitious project that aims to detect neutral hy- 
drogen to a redshift of z ~ 0.26, across ~ 70% of the 
sky. It is one of the two top ranked projects that will 
be carried out using the Australian SKA Pathfinder 
(ASKAP). WALLABY is possible because of ASKAP's 
unprecedented ~ 30 sq. degree field-of-view, which is 
achieved using Phased Array Feeds (PAFs). WAL- 
LABY will use all 36 of ASKAP's antennae, but due 
to limitations on computing resources will only pro- 
cess the inner 30 antennae (with a maximum base- 
line of 2km) to image the sky with a 30" synthesised 
beam and produce datacubes with voxels^ that project 
to ~10" on the sky. The high spatial resolution is 
complemented by an anticipated spectral resolution of 
3.86 km s . ASKAP spectral datacubes will therefore 
cover a large area of the sky to high resolution, which 
results in very large datacubes containing at least 2048 
x 2048 x 16 384 voxels. WALLABY will consist of ~ 
1200 of these large datacubes. The size and number of 
these datacubes renders manual source finding unfea- 
sible. The performance of the automatic source finder 
used by WALLABY will determine how many (Hi ) 



Principal Investigators: Baerbel Ko- 

ribalski and Lister Staveley-Smith. See 
www.atnf.csiro.au/research/WALLABY for more de- 
tails about the survey. 

2 Voxels are often referred to as pixels when discussing 
a single channel of a datacube. Technically though these 
'pixels' are still voxels. For this reason the term voxel is 
used throughout instead of pixel to aid consistency. 



galaxies are found by WALLABY. 

The majority of source finders in existence use in- 
tensity thresholding to find sou rces. SExtractor 
(jBertin fc Arnoutsll 19961) . SFind dHopkins et al.ll2002f ) 
and Duchamp ljWhitinell2008l . | 201 lh are good exam- 
ples of source finders based on intensity thresholding. 
Conceptually, intensity threshold source finders check 
every pixel (voxel) in an image (datacube) to see if the 
pixel (voxel) value is sufficiently extreme that it's un- 
likely to be noise. Once all of the source pixels (voxels) 
have been identified, they are combined into objects. 
The various intensity threshold source finders differ in 
how they estimate the noise, set a threshold for iden- 
tifying source pixels (voxels), pre-process the image 
(datacube) to improve the source finder results and 
the manner in which they create objects from source 
pixels (voxels). All intensity threshold source finders 
share an inherent limitation though. 

Consider an arbitrary source in a spectral datacube. 
Improved spatial and spectral resolutions result in the 
source occupying more voxels in the datacube. Dis- 
persing the source's signal over more voxels means that 
it contributes less to the flux value of each voxel that it 
occupies. This makes it harder for an intensity thresh- 
old method to detect the source. Using a simple model 
this effect is illustrated in Figure [1] where the maxi- 
mum voxel S/N of an object with an integrated S/N 
of 5 is plotted for various asymmetries. The maximum 
voxel S/N is calculated to be S / Ni nteg rated X fi/^/n, 
where /3 describes the asymmetry of the object's flux 
distribution and n is the number of voxels. 

By overlaying the minimum expected size (in vox- 
els) of WALLABY sources on Figure [T] we can assess 
the impact of this inherent limitation on WALLABY. 
The neutral hydrogen detected in emission is warm, 
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which gives it an intrinsic amount of dispersion. We 
will assume that any real WALLABY source extends 
over at least 3 channels. We will also assume that in 
every channel, a source occupies at least 3x3 voxels 
for a 30" synthesised beam and 10" voxels). Galaxies 
rarely lie at the middle of a voxel though, so a more re- 
alistic minimum is a grid of 4x4 or 5x5 voxels in every 
channel. It is expected that most galaxies detected by 
WALLABY will be unresolved or at most marginally 
resolved (Duffy, A., Meyer, M. & Staveley-Smith, L. 
2011, in preparation), so we also consider a 7x7 grid 
of voxels in every channel to account for marginally 
resolved, off-centre galaxies. After multiplying by the 
minimum number of channels to obtain the expected 
minimum size of WALLABY galaxies in voxels, the 
minimum size for these different grids is overlaid in 
Figure [1] This demonstrates that off-centre and/or 
marginally resolved galaxies will be difficult to detect 
with a basic implementation of an intensity thresh- 
olding source finder, unless the flux is asymmetrically 
distributed. Figure [T] also illustrates that this effect is 
amplified in 3-D datasets such as future WALLABY 
datacubes. For example, in a 2-D image the 7x7 and 
5x5 vertical lines would approximately lie at the posi- 
tion of the 4x4 and 3x3 vertical lines. 




250 



Object size in voxels 

Figure 1: The maximum voxel S/N of an object 
(with an integrated S/N — 5) plotted against the 
number of voxels comprising the object, n. The 
various lines correspond to different asymmetries 
in the distribution of the voxel's flux over the n 
voxels. The vertical lines (labelled) denote the 
minimum size of a point source extending over 
three channels and occupying 3x3, 4x4, 5x5 or 7x7 
voxels in every channel. 



This inherent limitation is compounded further by 
using a size-based rejection criterion to weed out false 
detections. If you only detect a few, unconnected vox- 
els the source will be flagged as a false positive by a 



size-based rejection criterion. Off-centre and/or marginally 
resolved galaxies that are detected because of asym- 
metric flux distributions are most likely to be detected 
in the form of a few, unconnected voxels. This effect 
will also show up as an enhanced fracture rate of ex- 
tended, well resolved sources. 

The inherent limitation of intensity thresholding 
based source finders can be offset by using more aggres- 
sive intensity thresholds (i.e., lower intensity thresh- 
olds), but it often results in many false (source) de- 
tections. The solution to this problem is to run the 
source finder on the datacube multiple times with the 
datacube smoothed to a different scale each time. This 
is however a very inefficient solution to the problem. 
Duchamp provides multiple options for dealing with 
this inherent limitation: a secondary 'growth' thresh- 
old, smoothing and a 3-dimensional wavelet reconstruc- 
tion of the dataset. 

As part of its design study, WALLABY has inves- 
tigated novel methods of source detection as an al- 
ternative to multiple passes of an intensity threshold 
source finder. The goal of this investigation was to 
develop source detection methods that are optimised 
for large datacubes with high spatial and spectral res- 
olution. The Characterised Noise Hi (CNHI) source 
finder that I present here is one of the novel source 
detection methods that have been developed. 

The rest of the paper is structured as follows. The 
conceptual framework for the CNHI source finder is 
presented in Section [21 The inherent limitations of the 
CNHI source finder are then discussed in Section [3] 
Next, the current implementation of the CNHI source 
finder is presented in Section [3] Finally, some example 
results are discussed in Section [5] before finishing with 
a summary. 

2 Conceptual framework 

The CNHI source finder is based on three concepts. 
The first concept is that WALLABY spectral datacubes 
can be treated as a bundle of Hi spectra, rather than a 
collection of voxels. The next concept is that contigu- 
ous blocks of voxels should be tested to see if they're 
a source, rather than individual voxels. The final con- 
cept is that a 'source' is detected by looking for a region 
in a datacube that doesn't look like noise. This is the 
inverse of most source finders which identify sources 
based on some idea of what a source looks like. In 
the rest of this section these concepts are explained in 
detail. 

The first part of the CNHI source finder conceptual 
framework is treating a WALLABY spectral datacube 
as a bundle of Hi spectra, which is akin to Integral 
Field Unit (IFU) observations. Each position on the 
sky has its own spectrum. Each spectrum in this dat- 
acube is correlated to some degree however with the 
neighbouring spectra. The ASKAP beam will deter- 
mine the degree of correlation between each spectrum 
and its neighbouring spectra. As explained later the 
correlation between spectra should not be a problem 
for the CNHI source finder. This conceptual view of 
a WALLABY datacube is very amenable to paralleli- 
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sation. 

The second component of the CNHI source finder 
conceptual framework is to test contiguous blocks of 
voxels instead of individual voxels. This concept is 
designed to take advantage of WALLABY datacubes 
having sufficiently high velocity resolution to reason- 
ably resolve most galaxies in frequency. As discussed 
above, high resolution data is problematic for source 
finding methods that analyse individual voxels. The 
high resolution is however advantageous when testing 
contiguous blocks of voxels. A single voxel with a flux 
value that is one standard deviation above the mean 
is not significant. Ten contiguous voxels that are all 
one standard deviation above the mean are, because 
it's improbable that this will happen by chance in a 
spectrum with negligible correlation. If we can test 
whether a contiguous block of voxels in a spectrum is 
likely to be source, then we are searching for sources in 
a way that is benefited by the high velocity resolution 
of WALLABY datacubes. 

Testing whether contiguous blocks of voxels are 
sources provides additional information compared to 
testing individual voxels. It is a reasonable expecta- 
tion that the test region that best fits the position and 
velocity width of a source in a given Hi spectrum is the 
most significant test region. If the test region is too 
small, then it should be less significant than a larger 
region that is also made up of source voxels. If the test 
region is too large, then the test region contains both 
source and pure noise voxels, which should result in a 
less significant test region. Identifying the position and 
width that results in the most significant test region, 
therefore provides an estimate of the source position 
and velocity width. 

The final concept of the CNHI source finder is to 
find sources by looking for regions in a WALLABY dat- 
acube that do not look like noise. Looking for regions 
that do not look like noise is a novel way to implement 
matched filtering. Rather than using many, many fil- 
ters that each describe a different type of source, we 
can use a single filter that looks like noise. This works 
because we can safely assume that the presence of a 
source, is what causes a contiguous region in our dat- 
acube to not look like noise. The key is to use the 
noise distribution as the filter, because it is relatively 
stable, even though individual realisations of the noise 
vary. 

How do we use the noise distribution as a filter 
to look for regions that aren't pure noise? Due to 
the large size and high resolution of WALLABY dat- 
acubes they are expected to be sparsely populated by 
sources. This means that an arbitrary Hi spectrum 
will be dominated by noise. If we select a test re- 
gion of contiguous voxels, then we can use the rest of 
the Hi spectrum as an example of noise. A compar- 
ative statj^ticaI_tes_t_sjirA Kolmogorov-Smirnov 
test (|Kendall fc Stuard 1 19791 1 can then determine the 
probability that the test region and the rest of the 
spectrum, which is noise dominated, come from the 
same distribution of voxel fluxes. In other words, us- 
ing a comparative statistical test we can identify if a 
test region looks like noise. This implementation can 
easily be adapted to 2-D images and spectra. 



The CNHI source finder uses the Kuiper test l|Kuiperl 

1960) to compare the voxel flux distribution of a test 
region to the rest of the LoS spectrum, and identify 
regions with non-noise voxel flux distributions. The 
Kuiper test is a variant of the Kolmogorov-Smirnov 
test that is cyclically-invariant. The Kolmogorov-Smirnov 
test is most sensitive to differences in the distributions 
about the medians of the distributions. The Kuiper 
test by contrast is equally sensitive to differences in 
the two distributions throughout their entire range. 

3 Inherent limitations 

There are two inherent limitations of the CNHI source 
finder. These limitations are tied to the conceptual 
framework. The first limitation is that the comparison 
of the test region with the rest of the Hi spectrum 
relies on a noise dominated Hi spectrum. The other 
limitation is that a test region needs to be sufficiently 
large for the Kuiper test to produce reliable results. 

The use of the Kuiper test to determine if a region 
in a Hi spectrum is noise-like, is predicated upon the 
assumption that the voxel flux values of the rest of the 
Hi spectrum are noise. It is expected that the assump- 
tion of noise dominated Hi spectra is valid for properly 
calibrated, flagged and continuum subtracted WAL- 
LABY datacubes with no significant baseline struc- 
ture. To illustrate the sparsity of WALLABY dat- 
acubes, a typical WALLABY datacube is only 0.6% 
source (measured in voxels) if 500 000 sources dis- 
tributed across 1 200 datacubes have a typical size of 
1 000 000 voxels. 

For datacubes that are not well behaved or suffi- 
ciently sparse (e.g., baseline structure or failed band- 
pass calibration), this can be dealt with by compar- 
ing a test region in a Hi spectrum to the subset of 
the remaining Hi spectrum in its immediate vicinity. 
The Kuiper test will be less sensitive, but this is off- 
set by making a valid statistical comparison. Fourier 
analysis, polynomial fitting and existing baseline struc- 
ture removal techniques (such as those implemented in 
Duchamp) can also be used in combination with this 
approach, or as an alternative. 

The validity of a Kuiper test is described by the 
Q parameter. For two samples containing m and ri2 
values, the Q parameter is rn x 712/ '{fix +712). This 
is the same Q parameter that describes the validity 
of the Kolmogorov-Smirnov test. For both the Kuiper 
and Kolmogorov-Smirnov tests it is accepted that the 
test results are valid for Q > 4. The test region in a Hi 
spectrum needs to be sufficiently large to satisfy Q > 
4, otherwise the Kuiper test results are increasingly 
spurious for increasingly smaller test regions. Setting 
n = m +ri2, Q = 4 and letting m be the minimum size 
of a test region, then we can solve for m. The minimum 
size of a test region is m = (n — \/n 2 — 16n) /2. 

For a WALLABY datacube the minimum size of 
a test region is 4 channels. This matches the mini- 
mum expected channel width of WALLABY Hi galax- 
ies. The use of the CNHI source finder on other dat- 
acubes needs to consider the minimum size of a test 
region. For reference the minimum test region size is 
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4 channels for n > 40, and the Kuiper test can not 
achieve Q > 4 for n < 15. 

4 Current implementation 

The current implementation of the CNHI source finder 
works in the following manner. A user calls the CNHI 
source finder from the command line with a list of in- 
put parameters. The software then figures out how 
many chunks to split the input file into, such that 
each chunk is at most 1GB in size. For each chunk, 
the software creates a bundled Hi spectrum for each 
position on the sky and uses the Kuiper test to find 
object sections in the bundled spectrum. Once all of 
the bundled Hi spectra have been searched for object 
sections, the software creates objects out of them using 
a variant of Lutz one pass algorithm (|Lutzi ri980V The 
list of objects, their properties and postage stamp im- 
ages are external to the chunks and new objects are 
added as each chunk is processed. Once all of the 
chunks have been processed, the final list of objects 
is tested against the user specified rejection criterion. 
The objects that remain are then output to a cata- 
logue, added to global moment and position-velocity 
plots and postage stamp images (including integrated 
spectra) generated. A flow diagram of the current im- 
plementation is presented in Figure [5] The following 
subsections provide more detail about the CNHI in- 
put and output, the bundling of multiple Hi spectra, 
finding object sections in a bundled Hi spectrum and 
the creation of objects from object sections. 

4.1 Inputs and outputs 

User input consists of the following parameters: 

1. Output code: The output catalogue and plots 
are created with names of OutputCode_obj.cat, 
OutputCode_plots.ps and OutputCode_spectra.ps 

2. File name: The path to and name of the .fits file 
that CNHI will search for sources. 

3. Pre-threshold: The probability threshold used 
to identify interesting regions that are likely to 
contain a sourcfl (Dimensionless quantity.) 

4. Threshold: The probability threshold used to 
identify sources within interesting regions 3 . (Di- 
mensionless quantity.) 

5. Minimum bounding box (3 values): The size cri- 
terion applied to an object's bounding box if it 
is to be retained. 

6. Bounding box filling factor: An object composed 
of fewer voxels than the minimum bounding box 
multiplied by the filling factor is rejected. 

7. Pseudo total intensity threshold: Objects with 
a pseudo total intensity less than this threshold 
are rejected. 

8. Merging distances (3 values): Objects which are 
separated by this many voxels or fewer in any 
dimension are merged into a single object. 

3 Note that a higher value digs deeper into the noise, and 
is equivalent to using a lower intensity threshold. 



9. Maximum scale: The maximum size of a test 
region in a Hi spectrum. Specified in number of 
channels. 

The CNHI source finder output consists of a cat- 
alogue, a global moment map, a global position- 
velocity diagram for both RA and Dec, and postage 
stamp images of each object (including an integrated 
spectrum). The CNHI source finder catalogue con- 
tains the following information for each object: 

1. ID: A numerical ID assigned to each object. 

2. Voxel count: The number of voxels that consti- 
tute the source. 

3. Pixel RA: The mean RA of the object's voxels. 

4. Pixel Dec: The mean Dec of the object's voxels. 

5. Pixel channel: The mean channel of the object's 
voxels. 

6. Intensity RA: The flux weighted mean RA of the 
object's voxels. 

7. Intensity Dec: The flux weighted mean Dec of 
the object's voxels. 

8. Intensity channel: The flux weighted mean chan- 
nel of the object's voxels. 

9. Voxel limits (6 values): The minimum and max- 
imum RA, Dec and channel of the object. 

10. Voxel flux statistics (5 values): The sum, mean, 
minimum, maximum and standard deviation of 
the object's voxel flux values. 

11. Two sets of W20 and VK50 measurements. 

12. Sparse representation: A sequence of values that 
describes a sparse representation of the object's 
3-D bit mask in the datacubes voxel co-ordinates. 

The W20 and W50 measurements are measured in 
two ways. The first set of W20 and W50 measurements 
uses the same method as Duchamp. First, the global 
maximum of the object's integrated spectrum is deter- 
mined. Next, starting from each end of the integrated 
spectrum the first channel with a flux greater than or 
equal to 20% and 50% of the maximum are identi- 
fied and used to measure W20 and Wso- The second 
set of width measurements was developed as part of 
the source finder framework that is used to implement 
CNHI. First, the total flux of the integrated spectrum 
is measured. The cumulative frequency distribution 
(cfd) of the total flux is then constructed as a function 
of channel number for the integrated spectrum. The 
inner 92.7% and 76.1%, which corresponds to the con- 
ventional VK20 &i W50 for a gaussian profile, are then 
used to determine W20 and W50. The cfd will oscil- 
late at the edges of the integrated spectrum because 
of noise. For this reason the 'inner' values are defined 
(starting from the left edge of the integrated spectrum) 
to start at the channels where the cfd never again dips 
below 0.0364 (W20) and 0.1195 (W 50 ), and ends at the 
channels where the cfd first rises above 0.9636 (W20) 
and 0.8805 (W50). This approach to measuring W20 
and W50 has a consistent physical meaning across all 
possible spectral profiles and 'in principle' averages out 
noise. 
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Get user input. i 



Determine size of input datacube. 

▼ 

Determine how to process the input datacube in 1GB 'chunks'. 
▼ 

Initialise arrays, sparse representations and list of available object IDs. 

▼ 

Determine scaling factor to convert datacube into mJy/beam. 

▼ 

For each 1GB chunk: 



Read datacube into memory. 

▼ 

Convert datacube chunk in memory into mJy/beam. 

T 

Initialise mask for chunk- 

▼ 

Write previously found objects into this chunk's mask- 

▼ 

For each RA.Dec position that doesn't overlap with previously processed chunks: 

[ Create bundled spectrum. 

▼ . 

Find interesting regions, then object sections inside interesting regions. 

▼ ~~ ~ 

| Update mask array with object sections. 

▼ 

Create/update objects from object sections by searching through mask, RA > Dec > channels. 



IF voxel is flagged as part of an object. 


y\ — NO Create new object with lowest available ID. 




Merge properties, sparse reps and postage stamp 




/ Existing \. 


images of all the objects into a single, new 




<^ objs in merging \ MANY 


object using the smallest of the existing IDs. 




\> volume? / 


Release/recycle other IDs, sparse reps and 






postage stamps. 




— ONE Add voxel to existing object. 



5: 



Size threshold objects that have passed out of merging volume of current voxel. 
Release/recycle IDs, sparse reps and postage stamps of objects that fail. 



▼ 

Size (again) and intensity threshold objects. Accounts for overlap between chunks. 

▼ 

Create output catalogue, global plots and postage stamp images. 



Figure 2: The algorithm describing the current implementation of the CNHI source finder is shown here 
as a flow diagram. 
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4.2 Creating bundled Hi spectra 

For each Hi spectrum the CNHI source finder bun- 
dles together this Hi spectrum and the neighbouring 
Hi spectra. This bundled Hi spectrum is searched for 
objects. A bundled spectrum is created by weighting 
the sum of the Hi spectrum and it's neighbouring Hi 
spectra using the point spread function. 

Searching for object sections in a bundled spec- 
trum has two advantages. The first advantage is that 
it slightly improves the S/N without blurring the edges 
of sources in velocity space or creating correlation in 
the bundled Hi spectrum. The second, more impor- 
tant advantage is the improved detection of the outer 
edges of a source. Bundling the Hi spectra couples 
the brighter flux in an object's inner region to the 
fainter flux at the edge of the object. This improves the 
chance of detecting the fainter, outer regions of the ob- 
ject. The amount of improvement varies non-linear ly 
with source morphology, user input and datacube, so 
no attempt is made here to predict the amount of im- 
provement. 

4.3 Finding object sections 

The Kuiper test is used to find object sections within 
a bundled Hi spectrum. This is implemented as a four 
step process. In the first step, the Kuiper test is used 
to identify interesting regions that are likely to contain 
an object section. The next step is to reduce the list of 
interesting regions to a unique set. This is achieved by 
finding all of the interesting regions that overlap, and 
only keeping the most significant interesting region. 
The most significant interesting region is judged to be 
the region with the lowest probability of being noise ac- 
cording to the Kuiper test. The purpose of these first 
two steps is to efficiently reduce the bundled spectrum 
to a manageable subset. Third, the Kuiper test is ap- 
plied to every position on every relevant scale to find 
object sections within these interesting regions. The 
final step is to reduce the object sections to a unique 
set. This is achieved in the same way that interesting 
regions are reduced to a unique set. 

Interesting regions are found by applying the Kuiper 
test to test regions in the bundled Hi spectrum and 
comparing the result to a user defined pre-threshold. 
Interesting regions are found efficiently by Nyquist sam- 
pling both the scales of interest and positions along 
the bundled Hi spectrum. Starting with the largest 
scale of interest (user specified) and the beginning of 
the bundled Hi spectrum, the Kuiper test is applied 
and then the test region is advanced half of the scale 
length. Once the entire bundled spectrum has been 
tested on this scale, the scale is halved and the process 
is repeated. This is repeated until the minimum scale 
has been processed. 

Interesting regions narrow down the location and 
scale of object sections. Only positions within the in- 
teresting region and scales ranging from half as small 
as the interesting region up to the size of the interest- 
ing region, need to be investigated. If other positions 
or scales were a better fit for object sections located 
within this interesting region, then this wouldn't be 



the most significant interesting region. The Kuiper 
test is applied to all possible combinations of position 
and scale efficiently using a test region that expands 
and then shrinks as it moves to different positions. 

4.4 Creating objects from object sec- 
tions 

Once the CNHI source finder has finished finding ob- 
ject sections in a chunk of the datacube, the object sec- 
tions are combined into objects. Objects are created 
from object sections using a variant of Lutz one pass 
algorithm. The essence of Lutz one pass algorithm is 
to raster scan through an image or datacube build- 
ing up the properties of every object as you go. Lutz 
crucial insight is that objects are simply connected so 
they pop out of the current image or datacube section 
being scanned (the scanline). This places a limit on 
the number of objects that need to be tracked and up- 
dated at any one time. Once the entire datacube is 
scanned, the rejection critierion is applied to the ob- 
jects. The surviving objects are written to the output 
catalogue. 

The crucial change to Lutz one pass algorithm is 
the use of sparse representations of 3-D objects. The 
use of sparse representations is what allows the CNHI 
source finder to process the datacube in chunks. A new 
sparse representation, which consists of three compo- 
nents, was developed expressly for this purpose. The 
first component lists the RA and Dec widths of the 
object's bounding box. The second component lists 
the number of object sections that make up the object 
prior to a given RA, Dec position. The final component 
lists the channels that each object section begins and 
ends at. The first component indexes the second com- 
ponent, which then indexes the third component. The 
structure of this sparse representation is illustrated in 
Figure 13 

The flaw of Lutz one pass algorithm is that it as- 
sumes objects are never encountered again after they 
'pop' out of the scanline. This assumption is quite eas- 
ily invalidated when a datacube is too big to process 
in a single pass. A crude solution is to split the dat- 
acube into overlapping chunks, process them individ- 
ually and then merge the detections within the chunk 
overlaps. This approach can easily produce erroneous 
results though, because it relies on the positions of a 
source's segments being sufficiently close to be merged 
together. A better method is to update each chunk's 
mask with all of the previously detected sources be- 
fore processing them. To do this we need to be able to 
efficiently store the binary mask of every object in a 
readily accessible format. The sparse representation of 
3-D objects presented here enables the CNHI source 
finder to do exactly that. This approach to process- 
ing arbitrarily large datacubes also makes the CNHI 
source finder readily amenable to distributed/parallel 
computing. 

An additional benefit of using the sparse represen- 
tations is that the binary mask of every object can 
be written to the CNHI output catalogue. Storing 
an object's binary mask allows us to run an arbitrary 
source parametrisation tool without having to re-run 
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Figure 3: An illustration of the novel sparse rep- 
resentation of 3D objects, that was developed for 
the CNHI source finder. 



the source finder to generate the binary mask. 

The CNHI source finder incorporates the sparse 
representation of 3-D objects in the following way. The 
flag array, which stores the object sections found in the 
datacube chunk currently being processed, is first up- 
dated for the objects found in previous chunks using 
their sparse representations. The CNHI source finder 
then scans through the flag array until it finds a voxel 
that has been flagged as source. The software then 
searches through the flag array to find previously pro- 
cessed objects within the voxel's user-specified merg- 
ing volume. This is a 3-D implementation of what is 
referred to as 8-connected linking in 2-D images. If 
there aren't any objects within the merging volume, 
then a new object and sparse representation is cre- 
ated. If a single existing object is found within the 
merging volume, then the source voxel is added to it. 
If multiple existing objects are found, then the existing 
objects are merged into a single object and the source 
voxel added to it. In each scenario the flag array is 
updated with the object id of the source that each 
source voxel belongs to. When existing objects pass 
out of the merging volume of all possible new objects, 
the user specified rejection criterion is applied to these 
existing objects. This ensures the minimal number of 
objects are stored in memory at any given time. The 
object ids of rejected objects and the memory used to 
store their properties and sparse representations are 
recycled. After the flag array has been scanned, the 
sparse representations and postage stamp images of 
the surviving objects are constructed/updated as re- 
quired. 



5 Example results 

An analysis of the completeness and reliability of the 
CNHI source finder, and a comparison of its perfor- 
mance to other source finders is presented in lPopping et al.l 
(2011). This paper presents a complementary analy- 
sis to Popping et al. (2011), using the same point 
source (PS) and extended so urce (ES) test datacubes 
(jWestmeier fe Poppind [201 il l. The terms complete- 
ness, raw reliability, refined reliability, merging rate 
and fragmenta t ion ra te are defined and measured as in 
i Popping et"aD (l201lh As in l Popping et alJ (|201lT) and 
IWestmeier fc Popping! (|2011h . the analysis presented 
here acknowledges the difference between raw reliabil- 
ity and refined reliability, the refined reliability being 
the reliability that is possible using post-processing to 
remove false detections from a source finder's output 
catalogue. From here on, the term 'reliability' will 
refer to the raw reliability. Note that the refined relia- 
bility has a corresponding refined completeness, which 
accounts for the true detections that are incorrectly 
flagged as false detections during post-processing. 

During visual inspection of the CNHI source finder 
detections in various parameter spaces, it was discov- 
ered that the total intensity versus maximum voxel 
intensity (intensity of the source's brightest voxel) pa- 
rameter space is the most efficient parameter space in 
which to post-process CNHI detections. For a given 
object with a given total intensity, the more compact 
it is the brighter the maximum voxel intensity. For 
each threshold a simple cut (a line) in this parameter 
space was used to post-process the CNHI detections. 
The line was adjusted for each threshold until the re- 
fined reliability was greater than 90%. This required 
minimal effort, and is a n example of the ty pe of post- 
processing advocated in lSerra et al.l (|201lT ) . 

The completeness, reliability, refined completeness 
and refined reliability are plotted in Figuresf4]and[5]for 
the PS and ES datacubes as a function of threshold. In 
Figure |4l the inherent limitation of the CNHI source 
finder is taken into account by measuring a 'corrected 
completeness'. The corrected completeness is mea- 
sured by excluding sources in the PS input catalogue 
with a full-width at half-maximum smaller than four 
channels. This corrected completeness therefore mea- 
sures the completeness of the sources that the CNHI is 
expected to find. The refined completeness in Figure 
[4] is that of the corrected completeness. The inherent 
limitation of the CNHI is only taken into account for 
the PS datacube, because a significant fraction of the 
sources in the PS datacube have Gaussian profiles with 
FWHM's that extend over 3 channels or less. Uncer- 
tainties are not provided for the curves in Figures f4] 
and [S] because to generate meaningful uncertainties 
requires generating many noise realisations. This is 
beyond the scope of this paper. It is also irrelevant, 
because the performance of the CNHI source finder on 
the PS and ES datacubes is not a guarantee or guide 
to the performance of the CNHI source finder on other 
datasets. 

As expected, when generating the curves in Fig- 
ures[4]and[5] the choice of pre-threshold has as much of 
an impact on the completeness as the choice of thresh- 
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old. To obtain the best completeness the pre-threshold 
should be set as high as compute resources and the dat- 
acube size will allow. Note that it was also observed 
that a pre-threshold larger than 0.1 or even 0.01 is 
excessive, and needlessly computationally expensive. 
Pre-thresholds larger than 0.1 or 0.01 are unlikely to 
improve upon the number of real sources that are re- 
covered. For this reason a pre-threshold of 0.01 was 
used to generate Figures [4] and [5] 
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Figure 4: Completeness (solid line), reliabil- 
ity (dashed line), corrected completeness (dotted 
line), refined corrected completeness (circles) and 
refined reliability (dot-dash line) curves for the PS 
datacube. 



The curves in Figures [4] and [5] illustrate that it is 
possible to achieve a good combination of refined com- 
pleteness and refined reliability. For the PS dataset it 
was possible to achieve a refined, corrected complete- 
ness of ~ 80% with a refined reliability of ~ 95%. A 
refined completeness of ~ 50% was achieved for the 
ES datacube, also with a refined reliability of ~ 95%. 
For both datasets the curves in Figures 0] and [S] have 
negligible merging and fracture rates. 

Thresholds of 10~~ 3 and 10~ 6 produce arguably the 
optimal combination of refined completeness and re- 
fined reliability in Figures [4] and [5] For this reason, 
the performance of the CNHI source when using these 
thresholds was examined further. First, the complete- 
ness was measured as a function of maximum voxel 
flux. Next, the fraction of total flux recovered by 
the CNHI source finder was measured as a function 
of source total flux. Finally, the distribution of the 
difference between the true position of the sources and 
the position measured by the CNHI source finder was 
determined. 

Figure [5] shows that the CNHI source finder can 
find almost all of the PS objects with a maximum voxel 
intensity five times brighter than the noise, with a high 



I 40 




log 10 threshold 



Figure 5: Completeness (solid line), reliability 
(dashed line), refined completeness (dotted line) 
and refined reliability (dot-dash line) curves for 
the ES datacube. 



refined reliability. The CNHI source finder also finds 
a significant fraction of PS objects with a maximum 
voxel intensity between three and five times the noise. 
Unfortunately, the CNHI source finder does not per- 
form as well at finding the ES objects. Visual inspec- 
tion of the undetected ES objects revealed that they 
are 'pancake' galaxies. These pancake galaxies are spa- 
tially resolved, but due to orientation only extend over 
a few channels. The bundling of Hi spectra acts as a 
crude spatial filter, and the bundling used here closely 
matches the spatial profile of the PS objects. This sug- 
gests that the CNHI source finder can be improved by 
using multiple bundling schemes. This improvement 
would be equivalent to using matched filtering in three 
dimensions, with the spatial and frequency dimensions 
using independent filters. 

The fraction of each source's total intensity that 
is recovered by the source finder is plotted in Figure 
[Jj and is referred to as the recovery fraction. The 
mean recovery fraction is measured after excluding 
fragmented sources, and is overlaid in Figure]?] A poor 
recovery fraction requires post-processing of each de- 
tection by a second tool to improve the source parametri- 
sation. The mean recovery fraction for the 10 -3 thresh- 
old in Figure [Jj asymptotes from ~ 80% to ~ 90% 
as sources become brighter. The recovery fractions 
greater than 100% are a result of comparing total in- 
tensities measured in the noisy datacube to a reference 
total intensity measured in the noise-free datacube. 
This is the most meaningful comparison though, be- 
cause it is sensitive to object masks that are either too 
large or too small. The recovery rate is as good as I 
would expect to do, without over-estimating the total 
flux. This demonstrates that minimal post-processing 
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Figure 6: The refined completeness as a function 
of source maximum voxel intensity. The refined 
completeness of the point sources (solid line) and 
extended sources (dashed line) is shown for thresh- 
olds of 1CT 3 and 1(T 6 . 



is required to improve the parametrisation of sources 
detected by the CNHI source finder. 

Using more conservative thresholds of 1CP 5 and 
10~ 7 , which do not push as far into the noise as higher 
threshold^], results in a median recovery fraction that 
asymptotes from ~ 50% to ~ 90% and ~ 50% to 
~ 70% as sources become brighter. The median re- 
covery fraction appears to asymptote more rapidly for 
more aggressive thresholds. This suggests that as the 
choice of threshold becomes more aggressive, the to- 
tal flux measured for a source will converge to its true 
value. This suggests two things. First, the sources 
with low recovery fractions in the left panel of Figure 
[7] are probably objects that have only just been de- 
tected, and using a more aggressive threshold would 
result in a higher recovery fraction. Second, an alter- 
native use of the CNHI source finder is as a source 
parametrisation tool rather than a source finder. 

The final component of the analysis is the distri- 
bution of the separation in voxels between the centre 
of the sources and their corresponding detection. The 
separation distributions of the voxel centres and in- 
tensity weighted voxel centres are presented in Figure 

m 

A separation between the centre of an object and 
its corresponding CNHI detection reveals biassed source 
detection. The centres of an object and its correspond- 
ing detection are calculated as an unweighted mean of 
the voxel positions or a flux weighted mean of the voxel 
positions. It can be safely assumed that the voxel and 
weighted voxel centres of an object are accurate. A 
non-zero separation therefore arises due to a distorted 
source detection, where one region of the source is de- 
tected more than the rest. If the mean voxel centre sep- 



4 The meaning of 'aggressive' thresholds for the CNHI 
source finder is the reverse of thresholds used in inten- 
sity threshold based source finders. For intensity threshold 
source finders, lower thresholds are more aggressive and 
push further into the noise. 




HQ 



separation (voxels) 



Figure 8: The relative distribution of separations 
between the voxel centres and flux weighted voxel 
centres of sources and their corresponding CNHI 
detections. The separations of the unweighted and 
intensity weighted PS positions is shown by the 
solid and dotted lines. The dashed and dot-dash 
lines show the separations of the unweighted and 
intensity weighted positions for the ES dataset. 



aration is sharply peaked about a separation of zero, 
then we can conclude that CNHI detections are on 
average unbiassed representations of the correspond- 
ing source. If the mean weighted voxel centre sepa- 
ration is similarly distributed, this demonstrates that 
the CNHI detections are on average an unbiassed rep- 
resentation of the source's flux distribution. In Figure 
[5] the unweighted (intensity weighted) separation dis- 
tributions of both the PS and ES datasets are sharply 
peaked at 0.7 (0.3) and 0.9 (0.5) voxels. This demon- 
strates that the CNHI source finder detections are typ- 
ically a fair, unbiassed representation of the underlying 
source and its flux distribution. 

This section finishes by presenting examples of the 
postage stamp images generated by the CNHI source 
finder. Figures l9l and [lOl are postage stamp images for 
two objects selected at random from the PS and ES 
dataset (one from each dataset). Both figures demon- 
strate the ability of the CNHI source finder to find the 
boundaries of sources. Figure [TD] also nicely illustrates 
that the CNHI source finder is capable of detecting ob- 
jects extending over many channels as a single object, 
rather than fragmenting it into two or more detections. 

6 Summary 

WALLABY and other projects that will be carried 
out on next-generation radio telescopes ASKAP and 
MeerKAT herald the start of the data deluge era in 
radio astronomy. The sheer size of WALLABY dat- 
acubes necessitates automation of many tasks in the 
data reduction pipeline, that previously would have 
been carried out with some level of manual input by 
an astronomer. Complete automation of finding Hi 
galaxies in spectral datacubes is one of the challenges 
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Figure 7: The recovery fraction of the total intensity of sources in the PS datacube for thresholds of 10 -3 
(left), 10~ 5 (middle) and 10~ 7 (right). Sources that have been fragmented into multiple detections are 
represented using solid circles instead of hollow circles. The median recovery rate of the non-fragmented 
sources is overlaid as a solid line. The median is measured in bins of size 400 mjy/beam km/s. Note that 
solid circles are only plotted in the left panel. The appearance of solid circles in the middle and right 
panels is due to a high density of hollow circles. 



that is actively being investigated by WALLABY. 

The resolution and size of WALLABY observations 
poses a challenge for many existing automated source 
finders. This challenge arises from the underlying con- 
ceptual framework and algorithm that these source 
finders are based on, and not a flaw in the implementa- 
tion. The CNHI source finder has been developed us- 
ing a conceptual framework that can handle the large 
size of WALLABY datacubes and takes advantage of 
the resolution. Treating a datacube as a set of spec- 
tra (akin to an IFU observation), it attempts to find 
sources by looking for regions in each spectrum that 
do not look like noise. This is achieved using a novel 
implementation of matched filtering. Instead of using 
multiple filters that describe various types of sources, 
a single filter describing the noise is used. Sources are 
detected using this noise filter by identifying regions 
that do not look like noise. 

The performance of the CNHI s ource finder was 
tested using the PS and ES datasets in lWestmeier fc Poppin 
: 201 I ). Analysis of the CNHI source finder output 
demonstrated that a reasonable combination of com- 
pleteness and refined reliability can be achieved. A re- 
fined completeness of ~ 80% and ~ 50% was achieved 
for the PS and ES datasets, respectively, with a re- 
fined reliability of ~ 95%. The PS dataset is better 
than the 80% completeness would suggest though, be- 
cause the CNHI source finder found ~ 95% of all PS 
objects with a maximum voxel flux > 5er, with a re- 
fined reliability of ~ 95%. This analysis also demon- 
strated that the CNHI source finder recovers a signifi- 
cant fraction of the source flux. The recovery fraction 
asymptotes towards 100% as the total flux increases. 
More aggressive thresholds (larger) result in a recov- 
ery fraction that asymptotes faster, and starts higher. 
This suggests an alternative use of the CNHI source 
finder as a source parametrisation tool, that is used 
in tandem with another source finder. Finally, the 
performance analysis demonstrated that CNHI detec- 
tions of a source are an unbiassed representation of the 
source and its flux distribution. These results are very 
promising, and warrant further testing and refinement 
of the CNHI source finder. 



There are three development goals for the CNHI 
source finder. Further development of the CNHI source 
finder will initially focus on incorporating multi-scale 
bundling. This will effectively achieve independently- 
scaled matched filtering in both the frequency and 
spatial dimensions. Additionally, the CNHI source 
finder will have a simple intensity thresholding test 
added to it. Incorporating an intensity thresholding 
test will make the CNHI source finder sensitive to 
sources occupying 3 or fewer channels. The final de- 
velopment goal is to incorporate fourier analysis, poly- 
nomial fitting and existing baseline structure removal 
techniques. Upon completing this next development 
cycle, the CNHI source finder will be tested and tweaked 
using the next round of ASKAP s imulations and the Hi 
Park es All Sky Survey (HIPASS) l|Stavelev-Smith et all 
120001 ) datacubes. The HIPASS datacubes have been 
selected because they have a well defined source cat- 
alogue, contain a mixture of resolved and unresolved 
clources, have known artifacts and calibration issues 
and there is a potential for the CNHI source finder to 
detect new sources. 
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Figure 9: Example postage stamp images pro- 
duced by the CNHI source finder for a PS object 
with a threshold of 10~ 3 . The red cross marks 
the intensity weighted position of the object. The 
boundary of the object is marked by a green 
line. From the top the images are a moment-0 
map, RA position-velocity diagram, Dec position- 
velocity diagram and integrated spectrum. The in- 
tegrated spectrum consists of both the object (red 
line), and an integrated spectrum over the entire 
bounding box of the object (black line). 
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Figure 10: Example postage stamp images pro- 
duced by the CNHI source finder for an ES object 
with a threshold of 10 ~ 6 . The intensity weighted 
position and boundary of the object are marked 
with a red cross and green line. The postage 
stamps are a moment-0 map (top), RA position- 
velocity diagram (second from top), Dec position- 
velocity diagram (second from bottom) and inte- 
grated spectrum (bottom). The red spectrum is 
the object and the black spectrum is a reference 
spectrum of the object's bounding box. 
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